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Abstract 


This  paper  describes  our  work  (the  HEIR  participation)  in  the  TREC  2012  Microblog 
Adhoc  Track.  We  proposed  a  ranking  algorithm  with  temporal  information  based  query.  More 
and  more  research  work  proved  that  time  is  an  important  factor  for  improving  the  search 
result,  especially  for  Microblog  search.  Based  on  Language  Model,  the  representative  work 
used  time  information  as  the  document’s  prior  information.  Intuitively,  there  were  two  ways 
for  making  use  of  this  feature.  One  way  was  query  relevant  while  the  other  was  query 
irrelevant.  The  hypothesis  of  the  two  models  is  “the  newer  of  the  document,  the  more 
important”.  However,  different  query  had  different  hot  time  points  (the  top  time  of  relevance 
documents’  time  distribution).  Take  this  into  consideration;  we  supposed  four  models  based 
on  hot  time  points  (HTLM).  On  this  basis,  we  considered  the  model  which  is  not  relevant  with 
query  as  document’s  background  information  and  the  model  which  is  relevant  with  query  as 
document’s  independent  information.  We  used  smoothing  operation  and  supposed  a  mix 
timed  language  model.  The  results  suggested  that,  HTLM  models  are  more  effective  for 
Microblog  search  and  mix  model  further  improved  compared  with  the  single  model. 
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ation  retrieval 


1  Introduction 


Microblog,  as  a  type  of  social  media,  has  become  immensely  popular  in  recent  years. 
There  exists  many  Microblog  websites,  such  as  twitter.  Compared  the  twitter  queries  and  web 
queries,  user’s  intent  is  always  like  to  find  more  freshness  information  when  searching  in 
Microblog[l].  These  queries  can  be  called  as  time-aware  queries.  In  our  work,  we  try  to  use 
time  information  based  on  language  model  to  improve  the  effectiveness  of  Microblog 
retrieval. 

In  the  background  of  language  model,  a  way  of  using  time  feature  is  as  document’s  prior 
by  defining  a  functional  relationship.  These  methods  can  be  categorized  into  two  kinds  by 
whether  it  is  dependent  to  query.  Li  and  Croft  [2]  assumed  that  the  document  which  is  newer 
has  more  probability  to  be  read  by  user.  So  Li  and  Croft  incorporated  an  exponential  decay 
into  the  language  model  as  document  prior  with  manually  parameter.  This  model  is  a 
time-independent  model.  Efron  and  Golovchinsky  [3]  expanded  Li  and  Croft  [2]’s  work  and 
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proposed  a  time-aware  language  model  with  query  information.  Efron  and  Golovchinsky 
thought  that  the  parameter  of  Exponential  distribution  should  be  different  when  queries  were 
different. 

Previous  works  assumed  that,  when  query  is  given,  a  document  is  more  important  while 
it’s  more  freshness.  However,  this  hypothesis  is  not  suitable  for  the  real  situation  when  query 
is  time-sensitive.  As  shown  in  Fig  1,  different  query  has  different  relevance  documents’  time 
distribution.  At  the  same  time,  [2]  also  indicated  the  phenomenon.  In  this  paper,  we  build  on 
these  findings.  These  peak  points  of  distribution  are  defined  as  “Hot  Time  Point  of  Query”, 
which  means  relevance  documents  are  more  likely  to  occur  in  these  moments.  We  proposed 
four  time-aware  language  models  that  everyone  relies  on  Hot  Time  Points  (HTLM).  HTLM 
models  belong  to  query-dependent  model.  From  another  perspective,  a  document  has  two 
kinds  of  time  information,  one  is  about  background  and  the  other  is  about  query-specific. 
Both  are  important  for  document  ranking.  So  we  build  a  mixed  model  by  using  smoothing 
method  and  proved  that  the  mixed  model  is  more  effective. 
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Fig-1  Relevant  Documents’  Time  Distributions  of  Queries  from  TREC  2011  Microblog 


2  Relevance  Work 


2.1  Language  Model 


Information  retrieval  area  used  language  modeling  frameworks  first  time  in  1998  by 
Ponte  and  Croft  [4].  In  this  paper,  we  rely  on  the  query  likelihood  model.  When  a  document  d 
and  a  query  q  are  given,  the  ranking  function  (1)  is  the  posterior  probability  that  the  document 
multinomial  language  model  generated  query [5].  p(d )  in  Eq.l  usually  set  to  a  fixed  value 
or  be  ignored. 

p(d  I  q)  oc  log  p(d )  +  Yj tf(w, q) log p(w I  md)  (1) 

wgV 

In  order  to  avoid  zero  probability,  Zhai  and  Lafferty  [6]  proposed  multi  smoothing 
methods.  The  simplest  smoothing  method  is  called  Jelinek-Mercer  smoothing,  which  gives: 

p(w  \md)  =  (l- A)Pml  ( w\md)  +  APml  ( w\mc )  (2) 

2.2  Timed  Language  Model 

Incorporating  document’s  creation  date  to  language  model  as  document  prior  has  two 
kinds  of  ways.  One  is  independent  about  query,  which  means  that  time  distribution  of  the 


collection  is  used  as  document’s  background  information.  Li  and  Croft  [2]  used  an 
exponential  distribution  of  document’s  creation  date  in  Eq.3,  which  shows  newer  documents 
have  higher  score. 

P(d)=P(d  \td)  =  re~rtJc~Td)  (3) 

Here  Tc  is  the  newest  date  in  the  collection,  y  is  the  rate  parameter  of  the  exponential 
distribution.  We  will  use  LC  as  shorthand. 

Efron  and  Golovchinsky  [3]  expanded  Li  and  Croft’s  work.  Efron  and  Golovchinsky 
pointed  that  document’s  prior  is  not  only  related  with  publication  time  but  also  related  with 
specific  query.  So  they  change  y  to  yq  for  using  query  information.  For  a  query  q,  we  let 

pseudo  document  set  P  =  { dx ,  d2 ,  •  •  • ,  dk }  and  the  time  set  of  these  documents 
T  =  {tx ,  t2 ,  •  •  • ,  tk  } .  The  function  of  computing  yq  is  : 

yqml=l/T  (4) 

Here  T  is  the  sample  mean  of  T.  In  the  later,  we  call  this  algorithm  EGML  as 
abbreviated. 

3  Time-aware  Mixed  Language  Model 

3 . 1  Query  Analysis 


Li  and  Croft  [2]  split  query  to  two  categories.  One  is  time-sensitive  query  which 
refers  to  these  queries  has  more  relevant  documents  in  a  specific  period  obviously,  the 
other  is  not.  Likewise  microblog  queries  are  almost  time-sensitive  queries[l].  We  use 
d  50  topics  of  TREC2011  Microblog  Adhoc  Track  to  analyze.  Fig.l  shows  four  topics’ 
time  distribution  of  relevant  documents  as  examples  (MB001,  MB009,  MB024,  and 
MB045).  EGML  assumed  that  the  document  which  is  newer  is  more  important  for  spe 
cific  query.  However,  some  times,  this  assumption  may  not  work.  In  Figure  1,  we  can 
find  that  the  most  relevant  documents  of  the  four  queries  do  not  distributed  in  the  n 
ewest  day,  for  example,  query  MB001  as  the  fourth  day,  and  query  MB045  as  the  fift 
h  day.  Specifically,  for  query  MB001,  EGML  will  improve  the  documents  which  publi 
cation  time  is  in  fourth  and  seventeen  day  where  there  have  little  relevant  documents. 

If  we  performed  in  accordance  with  this  assumption,  the  real  relevant  document  will 
be  punished  leading  to  bad  search  result. 

In  response  to  this  phenomenon,  we  hypothesize  that  the  document  is  more  likely 
to  be  related  when  its  time  is  more  closed  to  query’s  hot  time: 

Query’s  Hot  Time:  Given  a  query  q,  the  peak  points  set  of  its  relevant  documen 
ts  is  called  query’s  hot  time.  In  particular,  these  points  which  are  relatively  higher  wil 
1  compose  the  set. 

After  defining  query’s  hot  time,  the  parameter  Tc  in  Eq.3  is  different  value  for  di 

fferent  queries  instead  of  the  newest  time  of  the  collection.  The  intuition  of  this  assu 
mption  is  that  documents  which  near  query’s  hot  time  will  get  higher  ranking  locatio 
n. 

3.2  Hot  Time  Language  Model  (HTLM) 


The  analysis  of  query’s  relevant  documents  distribution  in  section  3.1  shows  that 
for  the  time-sensitive  queries,  the  peak  points  are  different,  which  be  called  as  query’s 
hot  time.  In  this  hypothesis,  the  challenge  is  to  get  the  query’s  hot  time  when  the  re 


levant  documents  cannot  be  known.  In  the  past  time,  using  pseudo  relevance  feedback 
in  information  retrieval  is  a  popular  way  to  be  used  in  place  of  the  relevant  docume 
nts.  The  time  distributions  of  queries’  top  500  documents  are  shown  in  Fig. 2.  In  the 
present  study,  we  get  first  retrieval  documents  as  the  pseudo  documents  by  using  the 
query  likelihood  language  model  (Eq.l). 
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Fig.2  Pseudo  Relevant  Documents’  Time  Distributions  of  Queries  from  TREC  2011  Microblog 

By  seeing  Fig.2,  we  can  find  that  the  peak  points  of  pseudo  relevant  documents’ 
time  distribution  is  almost  same  with  real  relevant  documents’  distribution.  This  means 
we  can  use  these  pseudo  peak  points  instead  of  the  real  peak  points.  According  to  st 
atistics,  there  are  21  queries’  pseudo  peak  point  is  same  with  real  point,  and  14  queri 

es  only  one  day.  So  we  decide  that  using  the  peak  point  as  query’s  real  hot  time  app 

roximately. 

Some  symbols  are  defined  as  follows:  given  a  query  q,  the  pseudo  document  set 
isVq  ={d1, d2,---,dk}  ,T  ={tvt2,---,tk}  means  the  time  set  ofP  .  So  according  to  tw 

o  methods  of  computing  parameter  y ,  while  Tc  =  Tcq  =  max(r),  we  will  get  two  mo 

dels.  One  is  manually  specified  in  Eq.3  (HTLM-LC).  The  other  is  given  in  Eq.4  (HT 
LM-ML). 

Further,  by  comparing  the  four  topics  in  Fig.2,  we  find  that  different  topics  have 
different  number  of  peak  points.  In  particular,  query  MB009  has  one  absolute  high  poi 
nt,  and  query  MB045  has  four  relatively  high  points.  This  finding  is  also  in  line  with 
the  mentioned  content  in  Li  and  Croft’s  paper  [2].  So  we  introduce  query’s  hot  time 

set  as  HotTcqs  =  {htx ,  ht2 ,  •  •  • ,  hthn  },htx  <ht2  <  •••  <hthn  by  rule: 

Rule:  Set  the  highest  number  of  the  time  distribution  is  MaxDN ,  if  and  only  if 
next  higher  number  is  greater  than  a *MaxDN ,  this  time  can  be  added  to  the  hot  tim 
e  set  HotTs. 

CC[ 

We  introduce  a  variables,  values  in  0.0-1.0,  which  means  the  degree  of  choosin 
g  hot  time.  .After  getting  the  HotTcqs  of  query  q,  we  define^: 

Tc  =  Tcqd  =  iht,  '■  Tcq  >  hti- 1  Clnd  Tcq  ^  ht,  }  • 

Then  we  also  get  two  models  by  the  computing  parameter  y  method  (HTLM-Ad 
aptiveMultiLC  and  HTLM-AdaptiveMultiML). 

Corresponding  to  the  characteristics  of  the  query,  we  proposed  four  models  based 
on  query’s  hot  time.  The  training  of  the  parameters  will  be  shown  in  section  4. 


3.3  Mixed  Time  Language  Model 


As  previously  mentioned,  we  can  classify  the  existing  work  into  two  parts,  one  is 
only  based  on  the  whole  collection  which  are  recorded  as  P(d,t),  and  the  other  is 
relevant  with  specific  query  as  P(d,q,t).  The  algorithm  without  query  information  is 
defining  the  relationship  about  the  document  and  time.  This  information  can  be  seen 
as  document’s  background  time  message.  P(d,q,t),  corresponding  to P(d,t),  considers 
the  specific  information  which  be  included  in  query.  This  can  be  seen  as  document’s 
query  time  message. 

When  given  a  query  q  and  a  document  d,  background  and  query  time  message  of 
the  document  is  indispensable.  So  we  use  smoothing  strategy  to  combine  the  two  kin 
ds  of  information  and  propose  a  mixed  time  language  model  (MixTimedLM): 

P(d)=coP(d ,  q ,  t)  +  (1  -  co)P(d ,  t) 

where  CO  is  a  mixing  parameter  that  controls  the  degree  of  smoothing.  Experimental  re 
suits  show  that  mix  model  further  improved  than  single  model. 

4  Experimental  Design  and  Results 


4.1  Experimental  Data  and  Parameter  Value 


After  preprocessing  of  Twitter  Data,  including  removing  RT  tweets  and  no-English 
tweets  and  removing  the  @  and  url  information  of  tweet  content  and  using  Porter  st 
emmer  for  content  stemming,  9679710  tweets  constitute  the  whole  collection.  Also  the 
re  are  50  topics  in  TREC  2011  and  38079  labeled  tweets. 

We  use  MAP  and  P@30  of  top  1000  results  as  evaluation  indicator.  In  this  paper, 
P@30  is  mainly  indicator  instead  of  MAP  specially.  Baseline  is  the  query  likelihood 

language  model  (QL)  with  smoothing  parameter^  ’s  value  0.4.  Based  on  the  test  colle 
ction  after  preprocessing,  we  tuned  the  parameters  that  mentioned  in  the  above  models 
by  using  5-fold  cross  validation.  The  values  of  these  parameters  are  shown  in  Tab.l. 


Tab.l  Model  Parameters’  Value  chosen  for  best  performance(P@30) 


Model 

Parameter 

Description 

Value 

LC 

r 

the  rate  parameter  of 
exponential  distribution 

0.3 

HTLM-LC 

r 

the  rate  parameter  of 
exponential  distribution 

0.2 

HTLM-AdaptiveMultiLC 

y 

the  rate  parameter  of 
exponential  distribution 

0.5 

a 

the  degree  parameter  for 
choosing  hot  time 

0.8 

HTLM-AdaptiveMultiML 

a 

the  degree  parameter  for 
choosing  hot  time 

0.9 

4.2  Timed  Language  Model  Experiment 


Table  2  shows  that  search  results  of  these  timed  models  on  the  training  set.  Com 
paried  with  Baseline,  LC  and  EGML  both  improve  MAP  but  decline  P@30,  with  HT 
LM  series  models  improving  P@30  significantly  and  almost  P@30.  This  performance 
of  P@30  indicator  improved  may  be  due  to  the  higher  importance  of  these  documents 
in  query’s  hot  time  near.  Because  closed  to  more  relevant  tweets,  the  ranking  locatio 


ns  of  these  tweets  are  prompted  in  the  rank  list.  Meanwhile,  some  non-relevant  docum 
ents  exists  in  the  specific  time,  this  measure  may  also  change  these  documents’  locati 
on  and  we  will  try  to  study  in  future  work. 


Tab.2  Retrieval  Effectiveness  on  TREC  2011  Microblog  Dataset 


Model 

P@30 

MAP 

BaseLine 

QL 

0.3252 

0.3099 

P(d,t) 

LC 

0.3244 

0.3168 

EGML 

0.3238 

0.3178 

HTLM-LC 

0.3327 

0.3146 

P(d,q,t) 

HTLM-ML 

0.3347 

0.3023 

HTLM-AdaptiveMultiLC 

0.3354 

0.3038 

HTLM-AdaptiveMultiML 

0.3367 

0.3142 

Next  we  will  test  the  performance  of  the  mixed  time  language  model.  According 
to  table  1  and  equation  5,  we  respectively  choose  the  parameters  value  of  the  top  10 
P@30  as  the  candidate  models  set  of  P(dj)  and  P(d,q,t).  Table  3  shown  the  top 
3  models  based  on  MAP  and  P@30  indicator,  where  CO  is  the  smoothing  degree  in  e 
quation  5. 


Tab.3  Mix  Time- Aware  Model’s  Retrieval  Effectiveness  on  TREC  2011  Microblog  Dataset 


P(d,t) 

P(d,q,t) 

CO 

P@30 

MAP 

LC 

y  =  0.2 

HTLM-AdaptiveMultiLC 

7  =  0.8  ,  a  =  0.9 

0.2 

0.3374 

0.3172 

P@30Top3 

LC 

7  =  0.3 

HTLM-AdaptiveMultiML 
a  =  0.6 

0.5 

0.3374 

0.3127 

LC 

7  =  0.3 

HTLM-AdaptiveMultiML 

a  =  0.9 

0.5 

0.3374 

0.3075 

LC 

7  =  0.3 

HTLM-AdaptiveMultiLC 

7  =  0.3  ,  or  =  0.8 

0.1 

0.3320 

0.3217 

MAPTop3 

LC 

7  =  0.3 

HTLM-AdaptiveMultiLC 

7  =  0.3  ,  a  =  0.4 

0.1 

0.3306 

0.3216 

LC 

7  =  0.3 

HTLM-AdaptiveMultiLC 

7  =  0.3  ,  a  =  0.5 

0.1 

0.3299 

0.3216 

In  Table  3,  the  retrieval  performance  of  mixed  model  is  better  than  QL,  LG  and 
EGML  based  on  MAP  or  P@30  respectively.  The  P(d,q,t)  models  are  both  HTLM- 
AdaptiveMulti  series,  which  display  that  the  hot  time  models  are  effectiveness. 

4.3  Parametric  of  Models  Sensitivity 


There  are  two  parameters  that  we  have  defined  and  set  in  this  paper  (Seen  in  Table  1). 
One  is  the  rate  parameter  of  exponential  distribution:  y  .  The  other  is  the  degree  parameter  for 
choosing  hot  time:  a  .  We  will  discuss  one  by  one  in  the  follow  paper. 

For  parameter/,  its  value  range  is  0.01-0.1  with  0.01  increase,  and  0.1-1  with  0.1 
increase.  Figure  3  shows  the  change  curve  in  the  parameter. 
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Fig.3  Performance  of  Models  (LC,  HotTLC,  AdaptiveMultiHotTLC)  when  parameter  y  changes 


From  Figure  3,  we  can  see  that  almost  every  model  reaches  the  highest  point  in 
the  value  of  0.1  -0.5.  When  y  is  in  the  value  of  0.1-1,  the  performance  of  LC  algori 
thm  decline  rapidly,  with  HotTLC  and  AdaptiveMultiHotTLC  algorithms  slowly.  Hence, 
we  can  conclude  that  the  models  based  hot  time  is  parameter-insensitive  algorithm. 

For  parameter  a ,  its  value  range  is  0.1-1  with  0.1  increase.  The  experimental  re 
suits  can  be  seen  in  Figure  4. 
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Fig.4  Performance  of  Models  (AdaptiveMultiHotTLC,  AdaptiveMultiHotTML)  when  parameter  Ct  changes 

Figure  4  tells  us  that,  with  the  changes  of  the  parameter  a  ,  the  curve  relative  c 
hanges  smoothly.  Concerned  about  the  P@30  curve,  the  performance  almost  becomes 
better.  The  reason  of  this  is  relatively  intuitive.  When  the  value  of  a  increasing,  doc 
uments  published  in  no-hot  time  are  exclude  so  that  to  reduce  the  noise  and  improve 
the  search  performance.  It  is  noticed  that  MAP  curve  change  unlike  our  expection.  In 
our  opinion,  the  performance  should  increase  when  the  value  of  parameter  a  adding 
and  will  achieve  optimal  when  a  is  0.8  or  0.9.  But  in  fact,  it  doesn’t.  Possible  reas 
on  caused  for  this  different  result  is  the  degree  of  curve  fitting.  In  detail,  for  Adaptiv 
eMultiHotTLC  algorithm  with  same  y  for  all  topics,  the  different  hot  degree  time  wil 
1  get  same  score  which  brings  differed  results  for  different  topics.  So  the  curve  of  the 
AdaptiveMultiHotTLC  changes  ups  and  downs.  For  AdaptiveMultiHotTML,  it  decease 
when  a  increase.  Possible  reason  is  that  the  MAP  improves  when  a  is  0.1,  becaus 
e  the  distribution  of  time  score  is  almost  fit  the  time  distribution  of  pseudo  document 
s,  and  at  the  same  time  the  distribution  of  pseudo  documents  is  almost  same  with  the 
distribution  of  real  relevant  documents. 

In  summary,  the  rate  parameter  of  exponential  distribution  y  has  less  affected  M 
AP  and  P@30.  The  degree  parameter  for  choosing  hot  time  a  impacts  P@30  stable, 


with  no  stable  for  MAP. 


5  Conclusion  and  Future  Work 


One  of  the  core  problems  for  microblog  retrieval  is  how  to  incorporate  time  infor 
mation  into  information  retrieval.  Based  on  Language  Model,  defining  the  create  time 
as  document’s  prior  is  a  generally  used  method.  There  are  two  kinds  of  models  Distin 
guished  by  whether  it  is  related  to  queries.  Both  the  models  are  based  on  the  hypothe 
sis  that  that  the  document  which  is  published  more  recent  is  more  important.  However, 
by  observing  the  queries,  we  find  that  the  assumption  is  not  suitable  for  every  topic. 
Different  queries  have  different  hot  time,  which  means  that  there  are  more  relevant  d 
ocuments  at  that  moment.  If  we  assume  that  newer  document  has  a  higher  prior  score, 
this  will  harm  relevant  document’s  ranking  position  if  more  relevant  docs  are  not  pu 
blished  at  the  newest  time.  So  we  propose  hot  time  language  series  model  (HTLM  se 
ries),  which  consider  different  topics  with  different  number  hot  time.  Finally,  we  defin 
e  four  models  with  two  parameters. 

As  said  above,  there  are  two  kinds  of  models,  one  is  based  on  the  whole  collecti 
on,  and  the  other  is  relevant  with  query.  HTLM  series  belongs  to  the  second  model. 
The  model  only  uses  whole  collection  information  can  be  seen  as  document’s  backgro 
und  message.  The  model  using  query  information  is  document’s  specific  message  abou 
t  specific  query.  So  given  a  document  d  and  a  query  q,  the  relationship  between  docu 
ments  and  time  should  be  combined  these  two  factor  information.  At  last,  we  propose 
a  mixed  time  language  model  by  using  smooth  strategy  and  proved  by  experiment  th 
at  mixed  model  is  better  than  single  model. 

In  the  future,  there  are  some  points  can  be  study,  including:  1)  exponential  distribut 
ion  is  the  most  popular  way  to  define  the  relationship  between  document  and  time, 
but  whether  it  is  the  best  distribution  need  more  work;  2)  tweet  has  many  features 
that  the  webpages  don’t  have,  for  example,  hashtag.  Our  next  work  is  trying  to  inc 
orporate  these  tweet  features  with  time  information  to  define  microblog  prior.  3)  as 
mentioned  above,  improving  one  time’s  importance  will  bring  relevant  documents  bu 
t  also  some  non-relevant  documents.  We  will  try  to  do  something  to  decrease  the  n 
on-relevant  documents’  existing. 
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